This might save you an hour of your life

Solutions to errors i encountered in a Spark Application

Fri, 05 Feb 2021

This might save you an hour of your life

This post is intended to write about three errors (rather silly!) that i encountered in my PySpark application and took me bit of a time to resolve them. Writing down the details here in case someone stumbles upon them and it might save you some time.

TypeError: ‘NoneType’ object is not iterable

Before we blame Spark, this is a Python error and is frequent. Iterables in Python is a collection of objects which can be iterated over. Every iterable has a function by the name of next that is being called everytime a new element is required and it stops when it encounters a ‘None’ value inside the iterable.

One thing to remember is:

  • [None] is an iterable since it is a list with a None value
  • None itself is not an iterable but a null value

Now all the transformations in Spark are basically built in functions that take an iterable and run the function over each element one at a time. Recreating the case where i got the error:

from pyspark import SparkContext, SparkConf  
  
conf = SparkConf().setAppName('test').setMaster('local')  
sc = SparkContext(conf=conf)  
  
def find_len(_id, x):  
    for num in x:  
        if num < 10:  
            return None  
 return len(x)  
  
def filter_value(x):  
    if x%2==0:  
        return x  
  
sample_data = sc.parallelize([1,2,3,4,5,6,7,8,9,10])  
sample_data.mapPartitionsWithIndex(find_len).filter(filter_value).collect()

In the function find_len, if my iterable has an element whose value is less than 10, i return None and this value is used as an iterable for the next filter transformation applied on RDD. That is where Python raises the error since filter function expects an iterable but gets a None value instead. The error could be avoided by returning an empty list [] instead which is an empty iterable.

ModuleNotFoundError: No module named ‘xxx’

This error follows return pickle.loads(obj, encoding=encoding) which is confusing. The only reason is we have been using an imported class or function in our Spark application and only imported it the Python way but did not import it the Spark way. This is where i got this error in my application

from IRSSpark import IRSSparkJob  
  
class CitiesCountJob(IRSSparkJob):  
    """ Count the number of tax files from each city in States"""  
  
  name = "CitiesCount"

My ‘CitiesCountJob’ class inherits from ‘IRSSparkJob’ and i imported it just like we import a parent class in Python from [module_name] import [class_name]

But this resulted ni Module not found error. This is because Spark’s submit job command expects all the imported .py files passed as a command line argument while running the application. So running my application by passing the .py imported files as args resolved the error. The syntax is $SPARK_HOME/bin/spark-submit --py-files [path_to_imported_module] [path_to_app]

PySpark error: AttributeError: ‘NoneType’ object has no attribute ‘_jvm’

This does not make any sense even if we stare at it for 5 mins. I encountered this error while doing something like this

from pyspark.sql.functions import *

def process_record(self, record):  
    city_name_col = list(filter(lambda x: (x[0] == 'CityNm'), record))

Both Python and Spark have a filter function. The difference is Python’s filter function works on an iterable and Spark’s filter function works on a RDD or a partition. In this example i was using Python’s builtin filter function over an iterable but my imports contain an import of filter function from pyspark SQL function module hence Spark’s filter function was overriding the Python’s filter function and raises the JVM error since transformation occurs one JVM at a time. This error can be avoided by removing the import from pyspark and using only Python’s built-in function.

Loading...
Ramsha Bukhari

Ramsha Bukhari

  • A personal blog intended to have a digital record of the things that excite me <> Tech, Photos and Books
  • delivered by Netlify